Introduction

The aim of this study is to conduct a market basket analysis using the unsupervised machine learning technique Association Rules, employing the Apriori algorithm. The analysis is extremely useful for learning about consumer preferences. It is possible to find out which products a consumer is more likely to reach for, by knowing that he or she has already reached for a particular product or set of products. With this knowledge, the company is able to plan more effectively a sales or discounting policy for specific products.

About the data set

The data set used in this analysis can be found on Kaggle. It contains basket data for 7501 transactions of 119 unique products.

Preliminary data analysis

Libraries necessary for the analysis

library(knitr)
library(arules)
library(arulesViz)
baskets <- read.transactions("Data\\Market_Basket_Optimisation.csv", sep = ",")
summary(baskets)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus

Our data contains 7501 rows, each representing a single transaction, and 119 columns corresponding to unique products. The highest number of products bought in one transaction is 20 and the most frequently purchased product was mineral water. On average, consumers bought 3 products during one transaction, however, most transactions did not exceed 5 products.

20 most frequent products

head(sort(itemFrequency(baskets, type="absolute"), decreasing = T), 20)
##     mineral water              eggs         spaghetti      french fries 
##              1788              1348              1306              1282 
##         chocolate         green tea              milk       ground beef 
##              1229               991               972               737 
## frozen vegetables          pancakes           burgers              cake 
##               715               713               654               608 
##           cookies          escalope    low fat yogurt            shrimp 
##               603               595               574               536 
##          tomatoes         olive oil   frozen smoothie            turkey 
##               513               494               475               469

20 least frequent products

head(sort(itemFrequency(baskets, type="absolute"), decreasing = F), 20)
##      water spray          napkins            cream          bramble 
##                3                5                7               14 
##              tea          chutney    mashed potato  chocolate bread 
##               29               31               31               32 
##     dessert wine          ketchup          oatmeal      babies food 
##               33               33               33               34 
##         sandwich        asparagus      cauliflower             corn 
##               34               36               36               36 
##            salad          shampoo hand protein bar   mint green tea 
##               37               37               39               42
itemFrequencyPlot(baskets, topN = 15, main = "Support of 15 most frequent products")

The plot above shows the 15 most frequently purchased products presented on a relative scale, also known as a “support”. More specifically, support shows how often a given set of elements or rule occurs in a data set. Only 10 products have support above 10%.

Apriori algorithm analysis

Let’s run the apriori algorithon on out data, setting threshold for support equal to 0.1 and confidence equal to 0.4, and inspect the results sorting by respectively: support, confidence and lift.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
r1 <- inspect(sort(rules1 , by = "support")[1:5])

Rules sorted by support

lhs rhs support confidence coverage lift count
[1] {ground beef} => {mineral water} 0.0409279 0.4165536 0.0982536 1.747521 307
[2] {olive oil} => {mineral water} 0.0275963 0.4190283 0.0658579 1.757904 207
[3] {soup} => {mineral water} 0.0230636 0.4564644 0.0505266 1.914955 173
[4] {salmon} => {mineral water} 0.0170644 0.4012539 0.0425277 1.683337 128
[5] {ground beef, spaghetti} => {mineral water} 0.0170644 0.4353741 0.0391948 1.826477 128

The most common association rule in our dataset is that when a consumer buys ground beef, he/she will most likely also buy mineral water.

r2 <- inspect(sort(rules1 , by = "confidence")[1:5])

Rules sorted by confidence

lhs rhs support confidence coverage lift count
[1] {eggs, ground beef} => {mineral water} 0.0101320 0.5066667 0.0199973 2.125563 76
[2] {ground beef, milk} => {mineral water} 0.0110652 0.5030303 0.0219971 2.110308 83
[3] {chocolate, ground beef} => {mineral water} 0.0109319 0.4739884 0.0230636 1.988472 82
[4] {frozen vegetables, milk} => {mineral water} 0.0110652 0.4689266 0.0235969 1.967236 83
[5] {soup} => {mineral water} 0.0230636 0.4564644 0.0505266 1.914955 173

When analysing association rules, we should pay attention to the important measure of confidence. It tells us the percentage of transactions where having a given item or set X leads to having an item or set Y. In our data, the highest confidence has the rule saying that when a consumer decides to buy eggs and ground beef, he/she is most likely to also buy mineral water.

r3 <- inspect(sort(rules1 , by = "lift")[1:5])

Rules sorted by lift

lhs rhs support confidence coverage lift count
[1] {ground beef, mineral water} => {spaghetti} 0.0170644 0.4169381 0.0409279 2.394681 128
[2] {eggs, ground beef} => {mineral water} 0.0101320 0.5066667 0.0199973 2.125563 76
[3] {ground beef, milk} => {mineral water} 0.0110652 0.5030303 0.0219971 2.110308 83
[4] {chocolate, ground beef} => {mineral water} 0.0109319 0.4739884 0.0230636 1.988472 82
[5] {frozen vegetables, milk} => {mineral water} 0.0110652 0.4689266 0.0235969 1.967236 83

Lift is also an important measure in the study of association rules. It assesses the strength of the relationship between two items in a transaction dataset, defined as the ratio of the observed support for the itemset (the presence of both items X and Y) to the expected support, assuming independence between the items. A lift value greater than 1 indicates a positive association between the items, meaning that the presence of item X increases the likelihood of item Y also being present. A lift value less than 1 indicates a negative association, and a lift value of 1 indicates independence between the items. Our results show that buying ground beef and mineral water increases the likelihood of also buying spaghetti.

Visuallisation of 10 strongest rules

plot(rules1, engine = "visNetwork", method="graph", limit = 10)
plot(rules1, engine = "default", method="paracoord", limit = 10, main = "Parallel coordinates plot for 10 strongest rules")

Analysis of the rules for french fries

We can also check what drives consumers to buy a particular product. Let’s check for french fries.

rules.frenchFries.rhs <- apriori(data=baskets, parameter=list(supp=0.01,conf = 0.04), 
                             appearance=list(default="lhs", rhs="french fries"), control=list(verbose=F)) 
rules.frenchFries.rhs.bylift<-sort(rules.frenchFries.rhs, by="lift", decreasing=TRUE)
f1 <- inspect(head(rules.frenchFries.rhs.bylift))
lhs rhs support confidence coverage lift count
[1] {burgers} => {french fries} 0.0219971 0.2522936 0.0871884 1.476173 165
[2] {frozen smoothie} => {french fries} 0.0145314 0.2294737 0.0633249 1.342654 109
[3] {cake} => {french fries} 0.0178643 0.2203947 0.0810559 1.289533 134
[4] {green tea} => {french fries} 0.0285295 0.2159435 0.1321157 1.263488 214
[5] {pancakes} => {french fries} 0.0201306 0.2117812 0.0950540 1.239135 151
[6] {chocolate} => {french fries} 0.0343954 0.2099268 0.1638448 1.228285 258

The results show that consumers are more likely to reach for french fries when they have had among others burgers, frozen smoothie or cake in their basket beforehand.

plot(rules.frenchFries.rhs, engine = "htmlwidget",  method="grouped")

And the opposite situation: what additional will consumer buy if french fries in his basket?

rules.frenchFries.lhs <- apriori(data=baskets, parameter=list(supp=0.01,conf = 0.04), 
                                 appearance=list(default="rhs", lhs="french fries"), control=list(verbose=F)) 
rules.frenchFries.lhs.bylift<-sort(rules.frenchFries.lhs, by="lift", decreasing=TRUE)
f2 <- inspect(head(rules.frenchFries.lhs.bylift))
lhs rhs support confidence coverage lift count
[1] {french fries} => {burgers} 0.0219971 0.1287051 0.1709105 1.476173 165
[2] {french fries} => {frozen smoothie} 0.0145314 0.0850234 0.1709105 1.342654 109
[3] {french fries} => {cake} 0.0178643 0.1045242 0.1709105 1.289533 134
[4] {french fries} => {green tea} 0.0285295 0.1669267 0.1709105 1.263488 214
[5] {french fries} => {pancakes} 0.0201306 0.1177847 0.1709105 1.239135 151
[6] {french fries} => {chocolate} 0.0343954 0.2012480 0.1709105 1.228285 258

The results show that consumers are more likely to reach for products such as a burger, frozen smoothie or cake when they have french fries in their basket.

plot(rules.frenchFries.lhs, engine = "htmlwidget",  method="grouped")

Summary

In this article, a market basket analysis has been carried out using the unsupervised machine learning technique Association rules, involving the Apriori algorithm. Mineral water proved to be the most frequently purchased item, dominating most rules, and the strongest rule measured by lift turned out to be: having ground beef and mineral water in the basket increases the likelihood of also buying spaghetti. In addition, the association rules between french fries and other products were examined in detail. The results indicated that the purchase of french fries slightly increases the likelihood of purchasing among others burgers, frozen smoothie and cake, and that this relationship is reciprocal. The above analysis has provided insight into the association rules in consumers’ choice of specific products, and may prove useful in planning a sales or discounting policy for specific products.